首页> 外文OA文献 >Scalable and Fault Tolerant Computation with the Sparse Grid Combination Technique
【2h】

Scalable and Fault Tolerant Computation with the Sparse Grid Combination Technique

机译:利用稀疏网格组合进行可扩展和容错计算   技术

摘要

This paper continues to develop a fault tolerant extension of the sparse gridcombination technique recently proposed in [B. Harding and M. Hegland, ANZIAMJ., 54 (CTAC2012), pp. C394-C411]. The approach is novel for two reasons, firstit provides several levels in which one can exploit parallelism leading towardsmassively parallel implementations, and second, it provides algorithm-basedfault tolerance so that solutions can still be recovered if failures occurduring computation. We present a generalisation of the combination techniquefrom which the fault tolerant algorithm is a consequence. Using a model for thetime between faults on each node of a high performance computer we providebounds on the expected error for interpolation with this algorithm. Numericalexperiments on the scalar advection PDE demonstrate that the algorithm isresilient to faults on a real application. It is observed that the trade-off ofrecovery time to decreased accuracy of the solution is suitably small. Acomparison with traditional checkpoint-restart methods applied to thecombination technique show that our approach is highly scalable with respect tothe number of faults.
机译:本文继续开发最近[B.]中提出的稀疏网格组合技术的容错扩展。 Harding and M. Hegland,ANZIAMJ。,54(CTAC2012),第C394-C411页。该方法之所以新颖,有两个原因,首先,它提供了多个级别,其中一个可以利用并行性导致大规模并行实现,其次,它提供了基于算法的容错能力,以便在计算过程中发生故障时仍可以恢复解决方案。我们提出了组合技术的概括,由此得出了容错算法。在高性能计算机的每个节点上使用故障之间的时间模型,我们为使用该算法进行插值的期望误差提供了界限。标量对流PDE的数值实验表明,该算法对实际应用中的故障具有弹性。可以看出,恢复时间与溶液精度降低之间的权衡很小。与应用于组合技术的传统检查点重新启动方法的比较表明,我们的方法在故障数量方面具有高度的可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号